feat: Add HiDream-O1 transformer and image generation pipeline#13749
feat: Add HiDream-O1 transformer and image generation pipeline#13749chinoll wants to merge 10 commits into
Conversation
|
@claude could you do a review? |
|
Claude finished @sayakpaul's task in 3m 44s —— View job Review of HiDream-O1 Transformer & Pipeline PR
Thanks for the contribution! This is a solid start for integrating HiDream-O1 into diffusers, and the parity testing approach is commendable. There are several architectural issues that need to be addressed before this can be merged, mostly around aligning with diffusers conventions. Blocking Issues1. Attention must use
|
|
Thanks for the review. I pushed follow-up changes up to Summary by review item:
Validation run locally:
I also smoke-tested Dev bf16 generation on a CUDA machine at 1024x1024 with both the official Dev timestep schedule and the default 28-step schedule; both generated images successfully. |
|
@claude review |
yiyixuxu
left a comment
There was a problem hiding this comment.
thanks for open the PR
i have two feedbacks on high-level:
- can you host the transformer in transformers library using remote code?
- can you host the pipeline in modular diffusers only? documentation here https://huggingface.co/docs/diffusers/main/en/modular_diffusers/overview we also have pretty good doc for AI agents on this so they should have pretty good idea how to turn it into modular one
|
@yiyixuxu thanks, pushed This update moves HiDream-O1 to modular diffusers only: the classic pipeline/docs/tests were removed, and I did not move the transformer to Transformers remote code in this PR. O1 is Qwen3-VL-style, but its image-generation path is not compatible with plain Validation: modular pipeline + transformer tests pass ( |
| ) | ||
|
|
||
|
|
||
| class HiDreamO1ForConditionalGeneration(Qwen3VLPreTrainedModel, GenerationMixin): |
There was a problem hiding this comment.
Here you subclass from the transformer Qwen3VL and then monkey-patch it. Diffusers models always inherit from `ModelMixin', and most of the layers are defined in the same file, with the exception of some small, common norm/embedding layers we import from the common place.
we really do not want a part-diffusers, part-transformers model in diffusers
I'm not asking you to upstreaem it into transformers, my ask was to package the custom modeling code as remote code on hub so it can be picked up using AutoModel.from_pretrained(..., trust_remote_code=True)
you can host the remote checkpoint on your personal hub account for the PR - we could either PR into the Hidream hub repo or host under a diffusers org once we merge the PR
| return outputs.sample[0, sample["vinput_mask"][0]].unsqueeze(0) | ||
|
|
||
|
|
||
| class HiDreamO1PromptSampleStep(ModularPipelineBlocks): |
There was a problem hiding this comment.
I think we should refactor these blocks into the standard files such as before_denoise.py, denoise.py, etc. as follows:
before_denoise.pyHiDreamO1SetTimestepsStepHiDreamO1PromptSampleStepHiDreamO1PrepareImageNoiseStep
denoise.pyHiDreamO1DenoiseStep
decoders.pyHiDreamO1DecodeStep
This is the standard modular design and makes the code easier to follow, especially if we want to add more blocks in the future. See the Flux 2 modular pipeline for a concrete example.
| return components, state | ||
|
|
||
|
|
||
| class HiDreamO1DenoiseStep(ModularPipelineBlocks): |
There was a problem hiding this comment.
I think we should refactor this into several blocks, with the denoising loop part being implemented as a LoopSequentialPipelineBlocks subclass, following the standard denoise step modular design. Concretely, this could look something like:
HiDreamO1DenoiseStep: defines how the denoising blocks compose together. Subclasses the loop wrapper class below. SeeFlux2DenoiseStepfor an example:diffusers/src/diffusers/modular_pipelines/flux2/denoise.py
Lines 464 to 466 in 7aa746c
HiDreamO1DenoiseLoopWrapper: inherits fromLoopSequentialPipelineBlocks, handles the denoising loop and progress bar. SeeFlux2DenoiseLoopWrapperfor an example:diffusers/src/diffusers/modular_pipelines/flux2/denoise.py
Lines 406 to 407 in 7aa746c
HiDreamO1LoopDenoiser: handles the loop step logic, including guidance (using a guider). SeeFlux2LoopDenoiserfor an example:diffusers/src/diffusers/modular_pipelines/flux2/denoise.py
Lines 45 to 46 in 7aa746c
HiDreamO1AfterDenoiser: handles the schedulerstepcall. SeeFlux2LoopAfterDenoiserfor an example:diffusers/src/diffusers/modular_pipelines/flux2/denoise.py
Lines 362 to 363 in 7aa746c
HiDreamO1BeforeDenoiser(if necessary): handles any preparation needed before the denoising loop. SeeWanLoopBeforeDenoiserfor an example:diffusers/src/diffusers/modular_pipelines/wan/denoise.py
Lines 37 to 38 in 7aa746c
| from .utils import PATCH_SIZE | ||
|
|
||
|
|
||
| class HiDreamO1Patchifier(ConfigMixin): |
There was a problem hiding this comment.
I think it is better to implement the pack_image/unpack_image methods inline where they are used (HiDreamO1PrepareImageNoiseStep and HiDreamO1DecodeStep) rather than defining a separate patchifier component.
| TIMESTEP_TOKEN_NUM = 1 | ||
| PATCH_SIZE = 32 | ||
| FULL_NOISE_SCALE = 8.0 | ||
| T_EPS = 0.001 |
There was a problem hiding this comment.
I think we don't need a PATCH_SIZE constant as we can read the patch size from the transformer component via transformer.config.patch_size. For the other constants, I think we put them in the files where they are used or inline them if they are used as an InputParam (e.g. FULL_NOISE_SCALE).
| InputParam( | ||
| "noise_scale_start", | ||
| type_hint=float, | ||
| default=FULL_NOISE_SCALE, |
There was a problem hiding this comment.
| default=FULL_NOISE_SCALE, | |
| default=8.0, |
Inline FULL_NOISE_SCALE as suggested in #13749 (comment)
| ] | ||
|
|
||
|
|
||
| def find_closest_resolution(width: int, height: int) -> tuple[int, int]: |
There was a problem hiding this comment.
I think we should inline the helper functions here which are only used once (such as find_closest_resolution) so it's easier to follow the code.
dg845
left a comment
There was a problem hiding this comment.
Thanks for the PR! Left an initial design review for the modular pipeline :).
What does this PR do?
This PR adds Diffusers support for HiDream-O1 image generation.
HiDream-O1 is a Qwen3-VL based image generation model that denoises raw RGB image patches directly.
Unlike HiDream-I1 and most image diffusion pipelines, it does not use a VAE component.
This PR adds:
HiDreamO1Transformer2DModel, aModelMixin/ConfigMixinwrapper for HiDream-O1 checkpoints.HiDreamO1AttnProcessor, a dedicated attention processor for the HiDream-O1 two-pass attention path.HiDreamO1ImagePipeline, a text-to-image pipeline for raw RGB patch denoising.scripts/generate_hidream_o1_image.py.Original implementation and checkpoints:
Notes
HiDream-O1 does not use a VAE. The pipeline prepares Qwen3-VL chat inputs, constructs O1 multimodal RoPE positions, denoises patchified RGB noise, and unpatchifies the final tensor into image space.
The transformer can also be loaded independently:
The pipeline can be loaded with:
For the dev checkpoint:
Tests
Result:
I also ran real image generation tests with the full and dev checkpoints in bfloat16, including multiple aspect ratios.
Before submitting
case).
Who can review?
@yiyixuxu ,@asomoza , @sayakpaul
Generate Image